-
Notifications
You must be signed in to change notification settings - Fork 955
Add node pfail and fail count to cluster info metrics #1910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
So the use case is to be able to write tests more reliably, or is there a "real" use case?
I'm trying to observe first time to node failure detection and time to mark it as complete failure. Without this data, it seems difficult to modify the algorithm and observe the change in behavior. |
|
Observability of failure detection. It's a great concept! ;) Yeah it can be useful for users too, not only for us. |
Signed-off-by: Harkrishn Patro <[email protected]>
|
The test seems flaky. Looking at it |
Signed-off-by: Harkrishn Patro <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1910 +/- ##
============================================
- Coverage 71.03% 70.99% -0.05%
============================================
Files 123 123
Lines 65682 65721 +39
============================================
- Hits 46660 46656 -4
- Misses 19022 19065 +43
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@valkey-io/core-team Please ack ( 👍 ) two new fields in CLUSTER INFO.
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think having this information is better than just asking end users to run cluster nodes/shards and count the number of failed/pfail nodes? I'm a little worried about end users alarming on this metric, even though it includes nodes that aren't part of quorum and aren't serving any traffic.
With large cluster I would prefer not pulling cluster nodes/shards output and compute this. |
Signed-off-by: Harkrishn Patro <[email protected]>
|
I've added voting nodes pfail/fail as well. If we decouple voting nodes from data serving node (primary) within the same architecture in the future, will have to add two additional metric (primary_fail / primary_pfail). @madolson let me know your thoughts. |
Signed-off-by: Harkrishn Patro <[email protected]>
Code changes: valkey-io/valkey#1910 Signed-off-by: Harkrishn Patro <[email protected]>
New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Nitai Caro <[email protected]>
New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]>
New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]>
New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: hwware <[email protected]>
New fields in CLUSTER INFO: * `cluster_nodes_pfail` * `cluster_nodes_fail` * `cluster_voting_nodes_pfail` * `cluster_voting_nodes_fail` I'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that. New output: ``` > CLUSTER INFO cluster_state:fail cluster_slots_assigned:0 cluster_slots_ok:0 cluster_slots_pfail:0 cluster_slots_fail:0 cluster_nodes_pfail:1 cluster_nodes_fail:0 cluster_voting_nodes_pfail:1 cluster_voting_nodes_fail:0 cluster_known_nodes:3 cluster_size:0 cluster_current_epoch:1 cluster_my_epoch:1 cluster_stats_messages_ping_sent:2104 cluster_stats_messages_pong_sent:1906 cluster_stats_messages_meet_sent:1 cluster_stats_messages_sent:4011 cluster_stats_messages_ping_received:1906 cluster_stats_messages_pong_received:1964 cluster_stats_messages_received:3870 total_cluster_links_buffer_limit_exceeded:0 ``` --------- Signed-off-by: Harkrishn Patro <[email protected]>
New fields in CLUSTER INFO:
cluster_nodes_pfailcluster_nodes_failcluster_voting_nodes_pfailcluster_voting_nodes_failI'm running few tests and trying to capture partially failed and completely failed count. Slot partially failed / completely failed stats exists but is more difficult to assess the node failure count with that.
New output: